Q($\lambda$) with Off-Policy Corrections

نویسندگان

  • Anna Harutyunyan
  • Marc G. Bellemare
  • Tom Stepleton
  • Remi Munos
چکیده

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(λ). We illustrate this theoretical relationship empirically on a continuous-state control task.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تأثیر مقیاس زمان سرمایش و غیر ایده‌آل بودن گاز در امواج ضربه ای

According to the suddenly compression of the matters in some regions of the compressible fluids, the density and temperature suddenly increases, and shockwaves can be produced. The cooling of post-shock region and non-idealness of the equation of state, $p=(k_B/mu m_p)rho T (1+brho) equivmathcal{K}rho T (1+eta R)$, where $mu m_p$ is the relative density of the post-shock gas and $Requiv rho_2 /...

متن کامل

Inverse Sturm-Liouville problems with transmission and spectral parameter boundary conditions

This paper deals with the boundary value problem involving the differential equation ell y:=-y''+qy=lambda y, subject to the eigenparameter dependent boundary conditions along with the following discontinuity conditions y(d+0)=a y(d-0), y'(d+0)=ay'(d-0)+b y(d-0). In this problem q(x), d, a , b are real, qin L^2(0,pi), din(0,pi) and lambda is a parameter independent of x. By defining a new...

متن کامل

Inverse Sturm-Liouville problem with discontinuity conditions

This paper deals with the boundary value problem involving the differential equation begin{equation*}     ell y:=-y''+qy=lambda y,  end{equation*}  subject to the standard boundary conditions along with the following discontinuity  conditions at a point $ain (0,pi)$  begin{equation*}     y(a+0)=a_1 y(a-0),quad y'(a+0)=a_1^{-1}y'(a-0)+a_2 y(a-0), end{equation*} where $q(x),  a_1 , a_2$ are  rea...

متن کامل

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD-style methods, such as off-policy act...

متن کامل

Complete O(α) QED corrections to the process ep → eX in mixed variables

The complete set of O(α) QED corrections with soft photon exponentiation to the process ep → eX in mixed variables (y = yh, Q 2 = Ql ) is calculated in the quark parton model, including the lepton-quark interference and the quarkonic corrections which were unknown so far. The interference corrections amount to few percent or less and become negligible at small x. The leading logarithmic terms p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016